Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

114 ◾ Bioinformatics

The studies on the consequences of variants focus on understanding the molecu-

lar mechanisms and pathways that link a genotype to a phenotype. This kind of studies

interpret the consequences of the variants on the protein function. Simple base substitu-

tion such as missense, stop gained, and stop lost variants can alter the translated protein

sequence, causing functional consequences. Moreover, functional consequences due to

structural variants are usually defined by the physiological phenotypes observed. These

can be complex descriptions which are described using general phenotypic traits rather

than specific biochemical effects caused by the variant. Clinical functional consequences

are represented by a simple controlled vocabulary that defines the relative pathogenicity

of a variant, such as benign, likely benign, uncertain significance, likely pathogenic, or

pathogenic.

The studies of population genetics are the studies of variation within populations of

individuals and the forces that shape it. This usually involves studying changes in frequen-

cies of genetic variation in populations over space and time. Some of the major forces that

shape variation in natural populations are mutations, selection, migration, and random

genetic drift. When a new mutation occurs, it may be beneficial to the organism, deleteri-

ous (harmful) to the organism, or it can be neutral (have no effect on the fitness of the

organism). Indeed, beneficial and deleterious mutations are subject to natural selection,

typically leading to increases and decreases in their allele frequency, respectively. Allele

frequencies are also influenced by the random genetic drift. This process explains the fluc-

tuation in allele frequencies from one generation to another.

4.2 VARIANT CALLING PROGRAMS

There are several programs for variant calling using different variant calling algorithms.

The most commonly used variant calling programs are categorized into two groups: con-

sensus-based callers like BCFTools mpileup and haplotype-based callers like FreeBayes

and GATK HaplotypeCaller. In the following, we will discuss these two types of variant

callers with some examples. We will assume that the FASTQ files used in the exercise are

preprocessed and clean as explained in Chapter 1.

4.2.1 Consensus-Based Variant Callers

The consensus-based variant callers depend on the pileup of the aligned reads covering a

position on the reference sequence to call the variants (SNVs or InDels). The read align-

ment information is in SAM/BAM file. We can then check the pileup of all bases of the

reads covering a reference base position. In most cases, the bases covering that position

will be the same as the base of the reference sequence, but in the case of variants, the bases

will be different from the reference base. The consensus sequence is created by collapsing

bases on all position and choosing the most frequent bases. In some positions, there may be

differences between the sequence of the reference genome and consensus sequence. These

differences can also be due to errors; however, when there is a sufficient sequencing depth,

that will provide sufficient confidence to call the variants. Figure 4.2 shows a diagram for

reads aligned to the sequence of a reference genome and a consensus sequence formed by